Head and summary of the Nhanes 2015/2016 dataset

knitr::kable(head(nhanes2))
SEQN Alcohol_Year Smoked_100 Gender Age Race Education Marital_Status Household_Size Income_to_Pov Systolic_BP1 Diastolic_BP1 Systolic_BP2 Diastolic_BP2 BMXWT BMXHT BMXBMI BMXLEG BMXARML BMXARMC BMXWAIST HIQ210 EducationX agegroup
83732 Yes Yes Male 62 White 5 Married 2 4.39 128 70 124 64 94.8 184.5 27.8 43.3 43.6 35.9 101.1 2 College/Uni graduate or above 60-69
83733 Yes Yes Male 53 White 3 Divorced 1 1.32 146 88 140 88 90.4 171.4 30.8 38.0 40.0 33.2 107.9 NA High school graduate 50-59
83734 Yes Yes Male 78 White 3 Married 2 1.51 138 46 132 44 83.4 170.1 28.8 35.6 37.0 31.0 116.5 2 High school graduate 70-80+
83735 No No Female 56 White 5 Living with partner 1 5.00 132 72 134 68 109.8 160.9 42.4 38.5 37.7 38.3 110.1 2 College/Uni graduate or above 50-59
83736 No No Female 42 Black 4 Divorced 5 1.23 100 70 114 54 55.2 164.9 20.3 37.4 36.0 27.2 80.4 2 Some college/Uni 40-49
83737 No No Female 72 Mexican American 2 Separated 5 2.82 116 58 122 58 64.4 150.0 28.6 34.4 33.5 31.4 92.9 NA 9-11th grade 70-80+
summary(nhanes2)
##       SEQN           Alcohol_Year       Smoked_100      Gender    
##  Min.   :83732   Don't know:   3   Don't know:   8   Female:2976  
##  1st Qu.:86164   No        :1728   No        :3406   Male  :2759  
##  Median :88668   Yes       :3477   Refused   :   2                
##  Mean   :88679   NA's      : 527   Yes       :2319                
##  3rd Qu.:91178                                                    
##  Max.   :93702                                                    
##                                                                   
##       Age                      Race        Education    
##  Min.   :18.00   Black           :1227   Min.   :1.000  
##  1st Qu.:32.00   Mexican American:1018   1st Qu.:3.000  
##  Median :48.00   Other Hispanic  : 750   Median :4.000  
##  Mean   :48.05   Other Race      : 901   Mean   :3.442  
##  3rd Qu.:63.00   White           :1839   3rd Qu.:4.750  
##  Max.   :80.00                           Max.   :9.000  
##                                          NA's   :261    
##              Marital_Status Household_Size Income_to_Pov    Systolic_BP1  
##  Married            :2780   1: 770         Min.   :0.000   Min.   : 82.0  
##  Never Married      :1004   2:1546         1st Qu.:1.060   1st Qu.:112.0  
##  Divorced           : 579   3:1037         Median :1.980   Median :122.0  
##  Living with partner: 527   4: 936         Mean   :2.403   Mean   :125.1  
##  Widowed            : 396   5: 699         3rd Qu.:3.740   3rd Qu.:134.0  
##  (Other)            : 188   6: 379         Max.   :5.000   Max.   :236.0  
##  NA's               : 261   7: 368         NA's   :601     NA's   :334    
##  Diastolic_BP1     Systolic_BP2   Diastolic_BP2        BMXWT       
##  Min.   :  0.00   Min.   : 84.0   Min.   :  0.00   Min.   : 32.40  
##  1st Qu.: 62.00   1st Qu.:112.0   1st Qu.: 62.00   1st Qu.: 65.90  
##  Median : 70.00   Median :122.0   Median : 70.00   Median : 78.20  
##  Mean   : 69.52   Mean   :124.8   Mean   : 69.35   Mean   : 81.34  
##  3rd Qu.: 78.00   3rd Qu.:134.0   3rd Qu.: 78.00   3rd Qu.: 92.70  
##  Max.   :120.00   Max.   :238.0   Max.   :144.00   Max.   :198.90  
##  NA's   :334      NA's   :200     NA's   :200      NA's   :69      
##      BMXHT           BMXBMI          BMXLEG         BMXARML     
##  Min.   :129.7   Min.   :14.50   Min.   :26.00   Min.   :28.20  
##  1st Qu.:158.7   1st Qu.:24.30   1st Qu.:36.00   1st Qu.:35.20  
##  Median :166.0   Median :28.30   Median :38.60   Median :37.10  
##  Mean   :166.1   Mean   :29.38   Mean   :38.58   Mean   :37.15  
##  3rd Qu.:173.5   3rd Qu.:33.00   3rd Qu.:41.20   3rd Qu.:39.00  
##  Max.   :202.7   Max.   :67.30   Max.   :51.50   Max.   :47.40  
##  NA's   :62      NA's   :73      NA's   :390     NA's   :308    
##     BMXARMC         BMXWAIST          HIQ210     
##  Min.   :17.10   Min.   : 58.70   Min.   :1.000  
##  1st Qu.:29.50   1st Qu.: 87.60   1st Qu.:2.000  
##  Median :32.70   Median : 98.30   Median :2.000  
##  Mean   :33.11   Mean   : 99.57   Mean   :1.915  
##  3rd Qu.:36.20   3rd Qu.:109.30   3rd Qu.:2.000  
##  Max.   :58.40   Max.   :171.60   Max.   :9.000  
##  NA's   :308     NA's   :367      NA's   :1003   
##                          EducationX     agegroup   
##  < 9th grade                  : 655   18-29 :1192  
##  9-11th grade                 : 643   30-39 : 933  
##  High school graduate         :1186   40-49 : 913  
##  Some college/Uni             :1621   50-59 : 888  
##  College/Uni graduate or above:1366   60-69 : 917  
##  9                            :   3   70-80+: 892  
##  NA's                         : 261

****3 functions created (saved in lib file). 1. ‘corr_func’ to quickly test correlations on the variables whilst exploring the data). The sole purpose using this rather than the built-in function is so the argument ‘complete.obs’ doesn’t have to be typed each time. 2. ‘grp_func’ to quickly allow the user to input a variable to group by and a variable to analyse which will output the min, max, and mean. 3. ‘prop_func’ takes the following aruguments: dataframe, catagorical group variable, and binary variable to analyse. Then outputs a new dataframe of proportions based on these arguments (split by gender). This will be used in the shiny dashboard.

Below code to firstly find out all the columns containing numeric data (numeric_vars. Then a new data table is created with these values (numeric_data). From this table, all the correlations were worked out wit the ‘cor’ function and saved under a variable names ‘correlations’. A correlation plot was created to clearly show any positive or negative correlations between these variables.

Group by agegrp to look at proportions of people who have smoked 100 cigarettes in thei’r life and people who haven’t. Assume that the the older generations will have a larger proportion of smokers than the youngers ones. Displayed in pie charts

smoke_df = filter(nhanes2, Smoked_100!="Don't know" & Smoked_100!='Refused')
prop = smoke_df %>%
  count(.data[['agegroup']], .data[['Smoked_100']], na.rm = TRUE) %>%
  mutate(prop = prop.table(n)) 
smoke = group_by(prop, agegroup) %>% transmute(Smoked_100, Proportion = n/sum(n))

p1 = smoke[1:2,]
p2 = smoke[3:4,]
p3 = smoke[5:6,]
p4 = smoke[7:8,]

pie1 = ggplot(p1, aes(x='', y=Proportion, fill =Smoked_100 )) + 
  geom_bar(stat='identity', width=1) + labs(title = 'Age - 18-29') + 
  coord_polar('y', start=0) + 
  geom_text(aes(label = percent(Proportion)), position = position_stack(vjust = 0.5), size = 5) +
  scale_fill_brewer(palette = 'Dark2') +
  theme_void()

pie2 = ggplot(p2, aes(x='', y=Proportion, fill =Smoked_100 )) + 
  geom_bar(stat='identity', width=1) + labs(title = 'Age - 30-39') + 
  coord_polar('y', start=0) + 
  geom_text(aes(label = percent(Proportion)), position = position_stack(vjust = 0.5), size = 5) +
  scale_fill_brewer(palette = 'Dark2') +
  theme_void()


pie3 = ggplot(p3, aes(x='', y=Proportion, fill =Smoked_100)) + 
  geom_bar(stat='identity', width=1) + labs(title = 'Age - 40-49') + 
  coord_polar('y', start=0) + 
  geom_text(aes(label = percent(Proportion)), position = position_stack(vjust = 0.5), size = 5) +
  scale_fill_brewer(palette = 'Dark2') +
  theme_void()

pie4 = ggplot(p4, aes(x='', y=Proportion, fill =Smoked_100)) + 
  geom_bar(stat='identity', width=1) + labs(title = 'Age - 50-59') + 
  coord_polar('y', start=0) + 
  geom_text(aes(label = percent(Proportion)), position = position_stack(vjust = 0.5), size = 5) +
  scale_fill_brewer(palette = 'Dark2') +
  theme_void()

grid.arrange(
  pie1, pie2, pie3, pie4, nrow=2, ncol = 2, top = 'Proportion of people who have smoked 100 cigarettes'
)

Below some box plots and scatter plots have been made influenced by the correlation plot. We hope to find some interetsing results from this. cor_func and grp_func was also used for futher testing. Based on the results some furhter hypothesis testing will be carried out.

#Box plots
b1 = ggplot(nhanes2, aes(x=agegroup, y=Systolic_BP1, color=Gender)) + 
  geom_boxplot()
b1

b1_test = filter(nhanes2, Age>69)
grp_func(b1_test, 'Gender', 'Systolic_BP1')
## # A tibble: 2 x 4
##   `data[[group_var]]`  Mean   Max   Min
##   <fct>               <dbl> <dbl> <dbl>
## 1 Female               140.   210    84
## 2 Male                 137.   204    92
##       Mean Max Min
## 1 138.2204 210  84
#Hypothesis test. 1 to see if men of all ages have higher Systolic BP than females, and 
#another to test if females aged 70+ have a higher Systolic BP than men. (test below)

#Filter dataframe to remove below value
df2 = filter(nhanes2, EducationX !='9')
b2 = ggplot(df2, aes(x=EducationX, y=Income_to_Pov, color=Gender)) + 
  geom_boxplot()
b2

nhanes2$Household_Size = as.factor(nhanes2$Household_Size)
b3 = ggplot(nhanes2, aes(x=Household_Size, y=Income_to_Pov, color=Gender)) + 
  geom_boxplot()
b3

#Hypothesis test to see if females living alone earn more than men living on their own 
#(test below)


b4 = ggplot(nhanes2, aes(x=Race, y=Income_to_Pov, color=Gender)) + 
  geom_boxplot()
b4

#Scatter plots
s1 = ggplot(nhanes2, aes(x=BMXWAIST, y=BMXBMI, color=Gender)) + 
  geom_point(alpha=0.2) +
  scale_color_manual(breaks = c('Male', 'Female'), values = c('blue', 'red'))
s1

corr_func('BMXWAIST', 'BMXBMI')
## [1] 0.910781
grp_func(nhanes2, 'Gender', 'BMXBMI')
## # A tibble: 2 x 4
##   `data[[group_var]]`  Mean   Max   Min
##   <fct>               <dbl> <dbl> <dbl>
## 1 Female               29.9  67.3  14.5
## 2 Male                 28.8  58.8  15.1
##      Mean  Max  Min
## 1 29.3822 67.3 14.5
#Filter dataframe to remove below values
df1 = filter(nhanes2, Smoked_100 !="Don't know" & Smoked_100 !='Refused')

s2 = ggplot(df1, aes(x=Systolic_BP1, y=Systolic_BP2, color = Smoked_100)) + 
  geom_point(alpha=0.2) + 
  scale_color_manual(breaks = c('Yes', 'No'), values = c('blue', 'red'))
s2

corr_func('Systolic_BP1', 'Systolic_BP2')
## [1] 0.9622873
s3 = ggplot(df1, aes(x=Systolic_BP1, y=BMXBMI, color = Smoked_100)) + 
  geom_point(alpha=0.2) + 
  scale_color_manual(breaks = c('Yes', 'No'), values = c('blue', 'red'))
s3

corr_func('Systolic_BP1', 'BMXBMI')
## [1] 0.1352012
s4 = ggplot(df1, aes(x=Age, y=Systolic_BP1, color = Smoked_100)) + 
  geom_point(alpha=0.2) + 
  scale_color_manual(breaks = c('Yes', 'No'), values = c('blue', 'red'))
s4

corr_func('Age', 'Systolic_BP1')
## [1] 0.4692335
grid.arrange(
  s1, s2, s3, s4, nrow=2, ncol = 2, top = 'title'
)

Shiny dashboard with interactive scatter plot and box plot. Interactive table using the prop_func

ui <- fluidPage()
server <- function(input, output) {} 



# Define UI
ui <- fluidPage(
  titlePanel("Nhanes Dashboard"),

  # Sidebar layout with a input and output definitions 
  sidebarLayout(
    
    # Inputs: Select variables to plot 
    sidebarPanel(
      
      # Select variable for y-axis 
      
      selectInput(inputId = "x", label = "Scatter - Variable in X axis:",
                  choices = c('Age', 'Systolic_BP1', 'Systolic_BP2',
                              'Diastolic_BP1', 'Diastolic_BP2', 'BMXWT', 'BMXHT', 'BMXBMI', 'BMXWAIST'),
                  selected = "Systolic_BP1"), # Select variable for x-axis
      selectInput(inputId = "y", label = "Scatter - Variable in Y Axis",
                  choices = c('Age', 'Income_to_Pov', 'Systolic_BP1', 'Systolic_BP2',
                              'Diastolic_BP1', 'Diastolic_BP2', 'BMXWT', 'BMXHT', 'BMXBMI', 'BMXWAIST'), 
                  selected = "Systolic_BP2"),
      
      sliderInput("range", 
                  label = "Scatter X interval:",
                  min = 0, max = 240, value = c(80, 240)),
      
      selectInput(inputId = "x2", label = "Box - Variable in X axis:",
                  choices = c('agegroup', "Alcohol_Year", "Smoked_100", "Race", "Education", "Marital_Status",
                              'Household_Size', 'Income_to_Pov'),
                  selected = "Household_Size"), # Select variable for x-axis
      selectInput(inputId = "y2", label = "Box - Variable in Y axis:",
                  choices = c('Systolic_BP1', 'Systolic_BP2', 'Diastolic_BP1', 'Diastolic_BP2', 
                              'BMXWT', 'BMXHT', 'BMXBMI', 'BMXWAIST', 'Income_to_Pov'), 
                  selected = "Income_to_Pov"),
      
      
      
      radioButtons("checkGroup", label = h3("Select data by:"), 
                         choices = list("Gender" = "Gender", "Age Group" = "agegroup",
                                        "Smoked 100" = "Smoked_100", "Alcohol Year" = "Alcohol_Year"),
                         selected = "Gender"),
    
    ),
    
    
    # Output 
    mainPanel(
      tabsetPanel(type="tab",
      tabPanel("Data", plotOutput(outputId = "scatterplot"),
      plotOutput(outputId = "boxplot")), 
      tabPanel("Summary of Nhanes dataset", verbatimTextOutput("summ")),
      tabPanel("Group table with proportions", radioButtons("Grp_prop", label = h3("Select data by group:"), 
                                                         choices = list("Age Group" = "agegroup",
                                                                        "Race" = "Race", "Education" = "Education", 
                                                                        "Marital Status" = "Marital_Status", "Household Size" = "Household_Size"),
                                                         selected = "agegroup"),
               radioButtons("analyse_prop", label = h3("Select data by:"), 
                            choices = list("Smoked 100" = "Smoked_100", "Alcohol Year" = "Alcohol_Year"),
                            selected = "Smoked_100"),dataTableOutput("prop"))),
      
  
      
      )
    ) 
  )

# Define server function
server <- function(input, output) {

  
  # Create the scatterplot object the plotOutput function is expecting 
  
  output$scatterplot <- renderPlot({
    nh3 = filter(nhanes2, !!sym(input$checkGroup) != "Don't know" & !!sym(input$checkGroup)!= "Refused")
    
    ggplot(data = nh3, aes_string(x = input$x, y = input$y, group = input$checkGroup,
                                      colour = input$checkGroup)) + geom_point(alpha = 0.5) +
                                      xlim(input$range[1], input$range[2])
      
      
    
    
  }) 
  output$boxplot <- renderPlot({
    nh3 = filter(nhanes2, !!sym(input$checkGroup) != "Don't know" & !!sym(input$checkGroup)!= "Refused" & !!sym(input$x2)!= "9" & !!sym(input$x2)!= "77" & !!sym(input$x2)!="Don't know" & !!sym(input$x2)!="Refused")
    nh3 = filter(nh3, !is.na(!!sym(input$checkGroup)))
    ggplot(data = nh3, aes_string(x = input$x2, y = input$y2,
                                      colour = input$checkGroup)) +geom_boxplot() 
  })
  
  # You can access the value of the widget with input$checkbox, e.g.
  output$value <- renderPrint({ input$checkGroup })
  
  output$range <- renderPrint({ input$slider })
  
  output$prop = renderDataTable({prop_func(nhanes2, input$Grp_prop, input$analyse_prop)})
  
  output$summ = renderPrint({summary(nhanes2)})
  
  
}
  
shinyApp(ui = ui, server = server)
## PhantomJS not found. You can install it with webshot::install_phantomjs(). If it is installed, please make sure the phantomjs executable can be found via the PATH variable.
Shiny applications not supported in static R Markdown documents

Hypothesis testing - all below tests based on significance level of 0.05

T-test to see if male who has smoked 100 cigarettes in his live and had at least 12 alcoholic drinks in 1 year has a higher Systolic blood pressure than a male who has not. H0 != H1

test1_1 = filter(nhanes2, Gender=='Male' & Smoked_100=='Yes' & Alcohol_Year=='Yes')
test1_2 = filter(nhanes2, Gender=='Male' & Smoked_100=='No' & Alcohol_Year=='No')

#Test using the t.test function
t.test(test1_1$Systolic_BP1, test1_2$Systolic_BP1, alternative = "two.sided", 
       var.equal = FALSE, paired = FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  test1_1$Systolic_BP1 and test1_2$Systolic_BP1
## t = 4.6317, df = 692.33, p-value = 4.333e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2.701071 6.676093
## sample estimates:
## mean of x mean of y 
##  128.8670  124.1784
#Test manually. As the std_dev on each sample are not similar, I have used the 
#unpooled approach
t1_n1 = nrow(test1_1)
t1_n2 = nrow(test1_2)
t1_u1 = mean(test1_1$Systolic_BP1, na.rm = TRUE)
t1_u2 = mean(test1_2$Systolic_BP1, na.rm = TRUE)
t1_sig1 = sd(test1_1$Systolic_BP1, na.rm = TRUE)
t1_sig2 = sd(test1_2$Systolic_BP1, na.rm = TRUE)

t_test = (t1_u1-t1_u2)/sqrt(((t1_sig1^2)/t1_n1)+((t1_sig2^2)/t1_n2))
t_test
## [1] 4.730432
#t-score for inbuilt_funtion = 4.6317, t-score worked out manually is 4.73. Both give a 
#p-value of <0.00001 which provides us with sufficient evidence to reject the null hypothesis 
#and say that a male who has drank and smoked does have a higher systolic blood pressure than
#a male who has not. We can futher back this up by looking at the confidence intervals 
#(2.7, 6.67) and seeing that 0 does not fall within this. Will run similar test for only the 
#alcohol part of this (below)

#T-test to see if somoen who has at least 12 alcoholic drinks in a year has a higher Systolic 
#BP than someone who has not.

test2_1 = filter(nhanes2, Gender=='Male' & Alcohol_Year=='Yes')
test2_2 = filter(nhanes2, Gender=='Male' & Alcohol_Year=='No')

#As the variance is similar (testing below), we will use thr pooled approach
sd(test2_1$Systolic_BP1, na.rm = TRUE)
## [1] 17.61412
sd(test2_2$Systolic_BP1, na.rm = TRUE)
## [1] 17.8955
t.test(test2_1$Systolic_BP1, test2_2$Systolic_BP1, alternative = 'two.sided', var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  test2_1$Systolic_BP1 and test2_2$Systolic_BP1
## t = 2.275, df = 2446, p-value = 0.02299
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.2770041 3.7363882
## sample estimates:
## mean of x mean of y 
##  127.4336  125.4269
#P-value of 0.02299 still provides us with enough evidence to reject the null and conclude that
#'on average' people who drink 12 alcoholic drinkls per year have a higher systolic BP than
#those tho don't

#Hypothesis test 3 - Do men of all ages have higher Systolic BP than women.
test3_1 = filter(nhanes2, Gender == 'Male')
test3_2 = filter(nhanes2, Gender == 'Female')

#Testing variancwe below. As the variance differs, we will use the unpooled approach
sd(test3_1$Systolic_BP1, na.rm = TRUE)
## [1] 17.64247
sd(test3_2$Systolic_BP1, na.rm = TRUE)
## [1] 19.06579
t.test(test3_1$Systolic_BP1, test3_2$Systolic_BP1, alternative = 'two.sided', var.equal = FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  test3_1$Systolic_BP1 and test3_2$Systolic_BP1
## t = 7.4453, df = 5397.1, p-value = 1.12e-13
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2.739755 4.698244
## sample estimates:
## mean of x mean of y 
##  126.9989  123.2799
#Hypothesis test to see if women >69 have a higher Systolic BP than men of the similar age
test4_1 = filter(nhanes2, Gender == 'Female' & Age > 69)
test4_2 = filter(nhanes2, Gender == 'Male' & Age > 69)

#Testing variancwe below. As the variance is similar, the pooled approach will be used
sd(test4_1$Systolic_BP1, na.rm = TRUE)
## [1] 20.7647
sd(test4_2$Systolic_BP1, na.rm = TRUE)
## [1] 20.46368
t.test(test4_1$Systolic_BP1, test4_2$Systolic_BP1, alternative = 'two.sided', var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  test4_1$Systolic_BP1 and test4_2$Systolic_BP1
## t = 2.4618, df = 851, p-value = 0.01402
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.7044388 6.2458476
## sample estimates:
## mean of x mean of y 
##  139.9763  136.5012
#Hypothesis test to see if females living alone earn more than males living alone.
test5_1 = filter(nhanes2, Gender == 'Female' & Household_Size == '1')
test5_2 = filter(nhanes2, Gender == 'Male' & Household_Size == '1')

#Signifincant difference in variance, so we will used the unpooled approach
sd(test5_1$Income_to_Pov, na.rm = TRUE)
## [1] 1.550615
sd(test5_2$Income_to_Pov, na.rm = TRUE)
## [1] 1.598473
t.test(test5_1$Income_to_Pov, test5_2$Income_to_Pov, alternative = 'two.sided', var.equal = FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  test5_1$Income_to_Pov and test5_2$Income_to_Pov
## t = 0.11781, df = 685.79, p-value = 0.9063
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.2201615  0.2482691
## sample estimates:
## mean of x mean of y 
##  2.279837  2.265783
#With a p-value of 0.9063, we fail to reject the null hypothesis based on a significance 
#level of 0.05. Confidence level also includes 0.